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1. Introduction 

A. Background and Motivation 

This paper is motivated by problems of multi-agent decision-making in dynamic and uncertain en- 
vironments. The basic setup consists of a network of agents and a controlled global state process or 
signal (a finite state Markov chain with controlled transitions). The state process is actuated by a remote 
controller whose actions and the resulting controlled state influence the statistical distribution of the 
random instantaneous costs incurred at the agents. The Markov decision process (MDP) that we consider 
pertains to collaborative welfare, i.e., specifically, the agent network is interested in obtaining the optimal 
stationary control strategy that minimizes the network-averaged infinite horizon discounted cost. Our 
multi-agent setup, for instance, resembles that of a thermostatically controlled smart building, in which 
the global state represents environmental dynamics affecting the spatial temperature distribution and the 
agents correspond to sensors distributed throughout the building. In this application, the objective of the 
building thermostatic controller is possibly of the reference tracking form, i.e., for example, to minimize 
the average of the squared deviations of the measured temperatures at the sensing locations from a 
desired reference value. It is important to note that the term agent has a generic usage here, whose 
scope varies from one application to the other. As another example, in which the agents correspond to 
social or organizational entities, consider a financial market setting. Here, the global signal may often be 
related to the dynamic market interest rate affecting, for example, the investment patterns of the agents, 
in which case the economic policies (actions) of the regulator (controller) may be shaped by the welfare 
motive to sustain an overall economic growth. The scope of our formulation is not limited to the above 
examples, and several practical scenarios, ranging from large-scale load control for efficient demand-side 
management in energy networks |[T1 to collaborative decision-making in multi-agent robotic networks |121, 
131 . abound that motivate our setup. 

Reinforcement learning, of which Q-leaming H, ||5l, IS is an instance, has proved to be a valuable 
practically applicable solution methodology for MDPs in scenarios involving lack of prior information 
on the problem statistics, that includes the transition behavior of the controlled state process and, in our 
multi-agent setting, the statistical distributions of the agents' instantaneous costs (generally varying from 
one agent to the other). Based on a reformulation of the Bellman equation, the class of Q-leaming methods 
generate sequential (stochastic) approximations of the value function using instantiations of state-action 
trajectories, as opposed to relying on exact problem statistics. The state-action trajectory instantiations 
for value function learning may correspond to online real-time data obtained while implementing the 
control, for example, ||5l, in which case the resulting Q-learning methods are, in fact, instances of direct 
adaptive control Q, or, may correspond to training data obtained through simulated state-action responses, 
see m for various exploration methods. However, a direct application of the above classical reinforcement 
learning techniques to our proposed multi-agent setting with possibly geographically distributed agents 
would correspond to the requirement that there exists a centralized computing architecture having access 
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to the instantaneous one-stage costs of all the agents at all times (see Section |2] for a more detailed and 
formal discussion). Since the instantaneous one-stage costs may only be observed locally at the agents, 
this, in turn, requires each network agent to forward its one-stage cost to the remote central location at 
all times, which may not be feasible due to limited energy resources at the agents and a bit-budgeted 
communication medium. This motivates us to consider a fully distributed alternative, the Q2?-leaming, 
in which the agents participate in autonomous in-network learning by means of local computation and 
communication over a sparse possibly time-varying communication network. 

There has been extensive research on multi-agent reinforcement learning (see 191, lITOll for surveys). 
Various formulations, ranging from general competitive dynamic stochastic games ifTTl . |[T2l . |[T3l . llT4l 
to so called fully cooperative ||Il,|[ia,||lIl,|[Il,||l3,||20l,||2D,E^ have been investigated (see IlOl 
for a more complete taxonomy). From the network objective viewpoint, the fully cooperative formulations 
are somewhat similar in spirit to our setup, in that both consider the optimization of a unique global 
quantity (the one-stage global cost corresponding to the network average of the random one-stage local 
agent costs in the current setting) - the key difference being that in the current formulation we impose 
the additional constraint that the instantaneous random realizations of the one-stage global costs are not 
directly observable at the agents. More specifically, at a given time instant, each agent has access to 
its local instantaneous one-stage cost only and not their network average; whereas (often by problem 
definition), the fully cooperative formulations mentioned above (see also |[23l for several decentralized 
variants) assume that the global one-stage costs are available at the agents at all times. Although, not 
directly comparable as the afore-mentioned approaches often involve decentralized actuation at the agent 
level as opposed to a remote process controller in our framework, we emphasize that, in the current 
context, they would require the network- average of the local instantaneous one-stage costs to be available 
at all agents at all times, which, given that the agents may be geographically distributed, would correspond 
to all-to-all agent communication at all times. What contrasts our proposed distributed approach from the 
existing literature is that we consider a fully distributed setting in which the agents disseminate the locally 
sensed costs through mutual neighborhood communication over a (pre- specified) sparse communication 
graph. 

However, we also point out two aspects of generic multi-agent reinforcement learning that are not 
addressed by the current formulation, namely, that of partial state observation and decentralized actuation. 
Specifically, we assume that each network agent can perfectly access or observe the global state. Moreover, 
in contrast to setups with local decentralized agent actuations, our framework applies to models in which 
the control actions are generated by a remote (global) controller. Further, we assume that the remote 
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control actions are perfectly known at the agent^ which could be a limitation in some applications. 

Our distributed approach is of the consensus + innovations type [24], in which the agents simulta- 
neously incorporate the information received from their communicating neighbors and the instantaneous 
locally sensed costs in the same update rule (see also ll25l . Il26ll . lITTl . ||28]| and |[29l for related literature). 
As such, the resulting value function update processes at the agents are mixed time-scale, in which the 
distinct potentials of consensus (corresponding to information mixing through neighborhood communica- 
tion l30ll . EU, |[32l . ||29l . ll33ll . |[34l . |[35l ) and local innovation (corresponding to the instantaneous 
locally sensed one-stage cost) are traded off appropriately. Without inter-agent communication (the 
consensus potential), the locally sensed one-stage costs at the agents are not sufficient to provide an 
observable approximation of the desired global cost functional. On the other hand, given that the inter- 
agent communication is not all-to-all, exact reconstruction of the instantaneous global one-stage cost is not 
possible, and, hence, it is imperative to appropriately balance the two potentials so that in the long term the 
network information diffuses sufficiently to guarantee asymptotic global cost observability at the agents. 
By suitably designing the time-varying weight sequences associated with the consensus and innovation 
potentials, we show that the QP-learning achieves optimal learning performance asymptotically, i.e., the 
network agents reach consensus on the desired value function and the corresponding optimal stationary 
control strategy, under minimal connectivity assumptions on the underlying communication graph (see 
Section |3] for details). Similar to direct adaptive control formulations (see, for example, IH), we allow 
generic statistical dependence on the state-action trajectories (processes) that drive the learning, which, 
in turn, in our distributed setting, leads to mixed time-scale stochastic evolutions that are non-Markovian. 
The analysis methods developed in the paper are of independent interest and we expect our techniques to 
be applicable to broader classes of distributed information processing and control problems with memory. 
From a technical viewpoint, in centralized or single-agent operation scenarios, the connection between 
Q-learning and stochastic approximation was made explicit in [5|. In this paper, we develop a distributed 
generalization of Q-learning, QP-learning, along the lines of consensus and innovations, thus extending 
the above connection to distributed multi-agent scenarios. 

On another note, the work in this paper is also related to problems of distributed optimization in multi- 
agent networks. The existing literature on distributed optimization (see, for example, |[36l . |[37l . ||38]| . 
|[39l . ll40l ) mostly consider static scenarios, in which, broadly the network goal is to minimize the sum 
(or average) of static (deterministic) local objectives, with each agent only aware of its local objective 

'Note that, for our setup involving a remote controller, the assumption that each network agent has access to the control actions 
may not be restrictive for applications of interest. For instance, often the control actions or decisions of the remote global policy 
maker (controller) may be directly observable to the network entities, for example, in financial or social network applications, 
the market or network entities are typically informed about the policies of the global welfare organization. In situations, where 
such direct observability is not possible, the control information may often be disseminated through network-wide broadcasts by 
the remote controller; given that the remote controller, being a global entity, has sufficient energy resources and that the action 
broadcasts are finite-bit (due to the finiteness of the action space), the action observability assumption may not be restrictive in 
many scenarios of interest. 
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function. Our formulation and results may be viewed as an extension of the above to dynamic uncertain 
scenarios, in which the environmental dynamics is modeled as a finite-state Markov chain, and, instead 
of optimizing over a static variable, the agents are interested in obtaining a control policy that minimizes 
a long-term running cost. Further, in contrast to the static distributed optimization scenarios, the current 
formulation assumes no prior information on the statistics of the local one-stage costs and the transition 
probabilities of the controlled state process; instead learns them from sequentially sensed data (costs). 
The rest of the paper is organized as follows. Section |1-B sets notation to be used in the sequel. 



The multi-agent learning setup is formulated in Section |2] Section [3] presents the proposed distributed 
version of Q-learning, QP-learning, in which we also formalize our assumptions on the system model 
and inter-agent communication. Intermediate results on the properties of distributed and mixed time-scale 
stochastic recursions are presented in Section [4j whereas. Section [5] is devoted to the convergence analysis 
of Q2?-leaming and the proof of the main result of the paper as stated in Section |3] Simulation studies 
comparing the convergence rate of the proposed distributed learning scheme with that of centralized 
Q-leaming are presented in Section |6] Finally, Section |7] concludes the paper and discusses avenues for 
further research. 

B. Notation 

We denote the A:-dimensional Euclidean space by M'^. The set of reals is denoted by M, whereas, M+ 
denotes the non-negative reals. The partial order on R'^ induced by component- wise ordering will be 
denoted by <c, i.e., for x and y in R'^, the notation x <c y will be used to indicate that each component 
of X is less than or equal to the corresponding of y. The set of k x k real matrices is denoted by M'^^'^. 
The corresponding subspace of symmetric matrices is denoted by S'^. The cone of positive semidefinite 
matrices is denoted by S^, whereas, S^_,_ denotes the subset of positive definite matrices. The k x k 
identity matrix is denoted by I^, while 1^ and 0^ denote respectively the column vector of ones and 
zeros in M.^. Often the symbol is used to denote the k x p zero matrix, the dimensions being clear 
from the context. The operator ||-|| applied to a vector denotes the standard Euclidean £2 norm, while 
applied to matrices denotes the induced £2 norm, which is equivalent to the matrix spectral radius for 
symmetric matrices. The £00 norm for vectors and matrices is denoted by \\-\\f^ - For a matrix ^ G S'^, the 
ordered eigenvalues will be denoted by Ai(^) < X2{A) < • • • < Ajt(^). The notation A (g) whenever 
applicable, is used to denote the Kronecker product of matrices A and B. 

Time is assumed to be discrete or slotted throughout the paper. We reserve the symbols t and s to 
denote time, T+ denoting the discrete index set {0, 1, 2, • • • }. 

Throughout, we will assume the existence of a probability space that is rich enough to support 

all the proposed random objects. For an event B £ T, the notation I{B) will be used to denote the 
corresponding indicator random variable, i.e., I{B) takes the value one on the event B and zero otherwise. 
Probability and expectation on will be denoted by P(-) and ¥,[■], respectively. All inequalities 

involving random objects are to be interpreted almost surely (a.s.), unless stated otherwise. 
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Spectral graph theory: The inter-agent communication topology may be described by an undirected 
graph G = {V,E), with V = [I - ■ ■ N] and E denoting the set of agents (nodes) and communication links 
(edges) respectively. The unordered pair (n, I) ^ E if there exists an edge between nodes n and /. We 
only consider simple graphs, i.e., graphs devoid of self-loops and multiple edges. A graph is connected 
if there exists a patlfj between each pair of nodes. The neighborhood of node n is 

nn = {l£ V\{n,l) G E} 

Node n has degree (i„ = jJlnl (number of edges with n as one end point.) The structure of the graph can 
be described by the symmetric N x N adjacency matrix, A = [Ani], Ani = 1, if {n,l) G E, A^i = 0, 
otherwise. Let the degree matrix be the diagonal matrix D = diag (di • • • ^Ar). By definition, the positive 
semidefinite matrix L = D — Ais called the graph Laplacian matrix. The eigenvalues of L can be ordered 
as = Ai(L) < \2{L) < • • • < Xn{L), the eigenvector corresponding to Ai(L) being (l/\/]V)ljv. The 
multiplicity of the zero eigenvalue equals the number of connected components of the network; for a 
connected graph, \2{L) > 0. This second eigenvalue is the algebraic connectivity or the Fiedler value of 
the network; see iHTI . Il42ll for detailed treatment of graphs and their spectral theory. 

2. System Model 

Let {x(} be a controlled Markov chain taking values in a finite state space X = [1, ■ ■ ■ , M]. Denoting 
by U the set (finite) of control actions u, we assume]^ that the state transition is governed by 

IP (xt+i = j I = i, = n) = p^j 

for every and u ^U, where the state transition probabilities satisfy YlijexPij ~ ^ i & X. 

We further assume that there are N agents, with agent n incurring a random one-stage cosj^Cn(i, u) 
whenever control u is applied at state i. For a stationary control policy vr, i.e., where {uf} satisfies 
= 7r(xf) for some function tt : X ^ U, \he state process {xj^} (the superscript tt is used to indicate 

path between nodes n and I of length m is a sequence (n = io, ii, • • • , im = of vertices, such that, (ife, i^+i) G E for 
all < fc < m - 1. 

'The letters i and j will be reserved mostly to denote a generic element of the state space X, whereas, u will denote a generic 
element of the control space U. Also, note that the state and control stochastic processes are denoted by bold symbols, {xt} 
and {ut} respectively, although they assume a finite number of values only. 

*Note that the instantaneous costs c„(-) depend only on the current state of the process and the control applied, but not on 
the successor state as is the case with some control problems. However, the latter formulations may often be reduced to the 
former (i.e., current state and control dependence only) by proper state augmentation. 
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the dependence on the control poUcy vr) evolves as a homogenous Markov chain withj^ 

P(xJVi=j|x7 = f)=4^). 

For a stationary policy vr and initial state i of the process {xj^}, the infinite horizon discounted cost is 
given by 

'NT 



V^;" = limsupE 



n=l t=0 



where < 7 < 1 is the discounting factor. Note that the cost Vj^, defined as such, is a global (centralized) 
cost, as it involves the one-stage costs of all the agents. The Markov decision problem (MDP) that we 
consider in this paper concerns the evaluation of the optimal infinite horizon discounted cost 

V* = mfVr (1) 

n 

and the associated stationary policy vr*, provided the latter exists. 

Let V* e M^^ denote [V{,--- , Vj^jf. Denote by T:R^^ ^ R^'^ the (centralized) dynamic program- 
ming operator with 



Ti{V) = min I 1 5^ E[cn{i, n)] + 7 ^ pI^V, i , 



(2) 



7i(-) denoting the i-th component functional of T(-), such that, T(V) = [71 (V),-- - , 
7m(V)]'^ for each V G M^^. The Bellman equation 1431 asserts that V* is a fixed point of T(-), i.e., 
T(V*) = V*. Further, for discounting factors 7 that are strictly less than one, it may be readily seen ||43l 
that the dynamic programming operator T(-) is a strict contraction, thus implying the value function V* 
to be its unique fixed point. As such, starting with an arbitrary initial approximation Vq G M^^, one 
obtains a sequence of iterates {Vt} of T{-), with = T*(Vo), such that, — )• V* as t — )• 00. 
The above iterative construction forms the basis of classical policy iteration methods for evaluating the 
desired value function V* (and hence the corresponding optimal policy vr*(-)), at least for the considered 
scenario with 7 < 1. However, in doing so, i.e., in constructing successive iterates of T(-), the value 
iteration techniques assume that the problem statistics (the expected one-stage costs and the state transition 
probabilities p"p are perfectly known apriori. 

Reinforcement learning methods are motivated by scenarios involving lack of information about the 
problem statistics. Based on a reformulation of the Bellman equation, T(V*) = V*, the class of Q- 
learning methods generate sequential (stochastic) approximations of the value functiorj^ using instan- 

^Note that, in general, the set of actions W is state-dependent, which can be accommodated in our formulation by redefining 
U to be the union of all state-dependent action sets and modifying the one-stage costs appropriately. 

^To be precise, as will be shown later, instead of generating successive approximations of the state- value function V*, i £ X, 
Q-leaming methods generate approximations of the so-called state-action value functions Q*_u, {i,u) £ X xU, (often known 
as the Q-matrices or factors) from which the desired value functions may be recovered. 
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tiations of state-action trajectories, as opposed to relying on exact problem statistics. The state-action 
trajectory instantiations for value function learning may correspond to online real-time data obtained 
while implementing the control, in which case the resulting Q-leaming methods are, in fact, instances 
of direct adaptive control |7], or, may correspond to offline training data obtained through simulated 
state-action responses. As far as analysis is concerned, the former subsumes the latter, as trajectories that 
are obtained in the process of real-time control implementation incur temporal statistical dependencies 
due to memory in the sequential control selection task. While the Q-leaming techniques discussed above 
are appealing as they relax the requirement of prior system model knowledge, in the context of our 
multi-agent setting, they rely on a centralized architecture that requires the instantaneous agent one-stage 
costs Cn(x(, U() (for each network agent n) to be available at a centralized computing resource at all times 
t with a view to obtaining an approximation of the sum of expectations in (|2]). Since, the instantaneous 
one-stage costs may only be observed at the agents, this, in turn, requires each network agent to transmit 
its one-stage cost to the remote central location at all times, which may not be feasible due to limited 
energy resources at the agents and a bit-budgeted communication medium. This motivates us to consider 
a fully distributed alternative, in which the agents autonomously engage in the learning process through 
collaborative local communication and computation. 

3. QP-LEARNING: DISTRIBUTED COLLABORATIVE Q-LEARNING 

In this section, we present a distributed scheme for multi-agent Q-learning, the QP-learning. Like its 
centralized counterpart, QP-learning is based on instantiations of state-action trajectories. In general, the 
state-action trajectories are sample paths of stochastic processes {xj} and {ut} taking values in X and 
U, respectively. In addition, we have the local one-stage cost processes, {c„(xt,Uf)} for each agent n, 
as a result of the randomly generated actions and states x^ that are accessible to the corresponding 
agents. The goal of QP-leaming scheme is to ensure that each agent eventually learns the value function 
V* based on the stochastic processes {xf}, {u^}, and the one-stage cost processes. To formalize the 
distributed agent learning, we impose the following measurability requirements that characterize the 
locally accessible agent information over time for decision-making. 

(M.l): There exists a complete probability space P) with a filtration {J-t\, such that the state 

and control processes, {x^} and {u^}, respectively, are adapted to Tt- The conditional probability law 
governing the controlled transitions of {x^} satisfies 

P(xt+i =j I Ji) =p;^;_^., (3) 

and we require for each n 

E [c„(xt, wt) I -Ft] = E [c„(xj, uj) I xt, Ut] , (4) 

which equates to E[c„(i,-u)] on the event {x^ = i,uj = u\, i.e., conditioned on the current state- 
action pair, the one-stage random costs are independent of Tt. Further, we assume that the random cost 
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Cn{xt,Ut) is adapted to Tt+i for each t. Note that ([3]l-(|4]) is a formal restatement of the fact that the 
online state-action trajectories and the associated costs that are to be used for value function learning, 
satisfy the controlled Markov transitions in accordance with the MDP. The obvious choice of such a 
filtration would be the natural one induced by the processed i.e., 

Ft = a ({xs, Us}s<t, {c„(xs, Us)}„eN,s<t) , (5) 

provided the state-actions are generated according to the given MDP dynamics. 

Also, we assume that the one-stage random costs possess super-quadratic moments, i.e., in particular, 
we assume there exists a constant ei > (could be arbitrarily small) such that 

¥.[cl+'^{i,u)\<^ (6) 

for all n, i, and u. 

Note that Tt, as defined above, represents the global network information at each time instant t. In 
the sequel, we will also need to characterize the local information Tn{t) available at each agent n at 
time t on which the agent's instantaneous local decision-making is based. The local information at an 
agent n reflects its locally sensed cost data and the messages or information it obtains from its neighbors 
over time among other locally observed variables, such as the instantaneous state and control data. To 
formalize, let m„ ;(t) denote the message that agent n obtains from its neighbor I € 0.n{t) at time t, 
where 0,n{t) denotes the time- varying (possibly stochastic) communication neighborhood of agent n at 
time t. The local information J-n(i) at agent n at time t (up to the beginning of time slot t + 1) is then 
formally represented by the cj-algebra 

Further, for the inter-agent message exchange process to be consistent with respect to (w.r.t.) the local 
information sequences {Tn{t)}, we require 

mi,n{t) G Tn{t) (8) 

for each pair of agents (n, I), such that, n G ^i{t) at all times t. The key difference between the global 
network information Tt (as would be available to a fictitious center for decision-making) and the local 
agent information J^n{t) is in terms of accessibility of the reward information - the latter consists of only 
the locally sensed reward data, whereas the former involves the sum-total network reward information 
from all agents at all times. The lack of global information at the local agent level justifies the need for 
collaboration, in which the agents engage in mutual neighborhood message exchanges with a view to 
eventually disseminating the required reward statistics across the network. With the above formalism of 

^For a collection J' of random objects, the notation cr(^) denotes the smallest cr-algebra with respect to which all the random 
objects in are measurable. 
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distributed collaboration, in particular ([7])-([8]l, it is readily seen that (as expected), 

N 

-Ft = V Tn{t), 
n=l 

where V denotes the 'join' of cr-algebras, i.e., the global information at an instant t is the sum-total 
of the local agent information, provided corresponds to the natural filtration induced by the state- 
action pairs and the instantaneous rewards as in ([5]). Moreover, in general, we have C Tt for each 
n and t, the inclusion being strict usually if the inter-agent communication graph is not complete. In 
general, we are interested in applications with sparse inter-agent connectivity in which, even with agent 
collaboration, the local information sets J-'„(t) are strict subsets of the global Tt as explained above, 
and the fundamental goal of this paper is to design distributed message exchange and local processing 
policies that in the long-run lead to sufficient network-wide information dissemination, such that, each 
agent eventually obtains an accurate estimate of the desired value function V*. As will be seen, a 
necessary condition for successful eventual information dissemination involves long-term connectivity of 
the inter-agent communication graph. To this end, we assume that the time-varying stochastic inter-agent 
communication graphs (generating the neighborhoods ^n{t) for each agent n at every instant t) satisfies 
the following weak connectivity condition: 

(M.2): To account for possible random packet losses or infrastructure failures, as is commonly en- 
countered in wireless multi-agent communication settings, we assume that the agent network at time t 
is modeled as an undirected graph, Gt = {V,Et), with the graph Laplacians being a sequence of 
i.i.d. Laplacian matrices {Lt}. Specifically, we assume that Lt is J-t+i adapted and is independent of 
Tt- We do not make any distributional assumptions on the link failure model. Although the link failures, 
and so the Laplacians, are independent at different times, during the same iteration, the link failures 
can be spatially dependent, i.e., correlated. This is more general and subsumes the erasure network 
model, where the link failures are independent over space and time. Wireless agent networks motivate 
this model since interference among the wireless communication channels correlates the link failures over 
space, while, over time, it is still reasonable to assume that the channels are memoryless or independent. 
Finally, note that we do not require that the random instantiations Gt of the graph be connected; in 
fact, it is possible to have all these instantiations to be disconnected. We only require that the graph 
stays connected on average. Denoting E[Lt] by L, this is captured by assuming A2 (L) > 0. This weak 
connectivity requirement enables us to capture a broad class of asynchronous communication models; for 
example, the random asynchronous gossip protocol analyzed in l\44y satisfies A2 (L) > and hence falls 
under this framework. On the other hand, we assume that the inter-agent communication is noise-free 
and unquantized in the event of an active communication link; the problem of quantized data exchange 
in networked control systems (see, for example, / I45I/ , / I46I/ , / I47I/ , MES) is an active research topic. 

(M.3): At each t, the Laplacian Lt is assumed to be independent of the instantaneous costs c„(xt,ut) 
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conditioned on the state-action action pair {xt,ut). 

We now consider QP-leaming, in which network agents engage in mutual collaboration with a view 
to learning the true value function V* eventually. 

Before presenting the distributed update rule, for each pair (i,n), let us introduce the sequence of 
random times {Ti^u{k)}, such that, Ti^u{k) denotes the (A; + l)-th sampling instant of the state-action 
pair (i, u), i.e., 

Ti,u{k) = inf |t > I J] I(x„uO=(*,«) = ^ + l| , (9) 

for each A; > 0, in which we adopt the convention that the infimum of an empty set is oo. It can be 
shown that the random time Ti^u{k), for each k and pair (i, u), is a stopping time w.r.t. the filtration {Ft\. 
Further, note that, since we assume that the state-action pairs (xt,ut) are accessible to the agents also, 
see (|7]l, Tj u(A;), for each k, qualifies as a stopping time w.r.t the local filtrations {Fn{t)} as well. The 
following requirement that ensures each state-action pair (i, u) is observed (simulated) infinitely often is 
imposed: 

(M.4): For each state-action pair {i,u) and each k > 0, the stopping time Ti^u{k) is a.s. finite, i.e., 

F (Ti^uik) < oo) = 1. 



It is to be noted, that (M.4) is required in all forms of centralized Q-learning, either real-time direct 
adaptive control based or simulation based approaches, for desired convergence with generic initial 
conditions (approximations). 

QP-Iearning: In QP-leaming, each network agent n maintains a rI"^^^' -valued sequence {Q"} 
(approximations of the so-called Q matrices) with components for every possible state-action pair 

(i,n). With this, the sequence {Q"„(i)} at each agent n for each pair (i,n) evolves in a collaborative 
distributed fashion as follows: 

Qluit + I) = Qluit) -(3i,u{t) {Qluit) -QU^)) (10) 

/GO„(i) 



+ai,uit) (^Cnixt,ut) + -finmQ'^^^^ Jt) - 



where the weight sequences {/3j,M(i)} {o^i,uit)} are J^„(t)-adapted stochastic processes for each pair 
(i, u) and given by 

if t = Ti^uik) for some > 



and 



fi^,uit) = { ^'-"'^^^ (11) 

I otherwise, 



.f < = for some. >0 ^^^^ 

otherwise, 
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a and b being positive constants. In other words, as reflected by the weight sequences ([TT])-([T2]), at 
each agent n, the component Q^^it) is updated at an instanj^t ijf the current state-action pair (xt,ut) 
corresponds to and otherwise stays constant. 

In addition to the processes {Q"}, each agent n maintains another {J^„(t)}-adapted Rl'^'-values process 
{V"}, that serves as an approximation of the desired value function V*. The i-th component of V", 
Vl^{t), is successively refined as 

Vr{t) = m\nQl^{t), (13) 
new 

for i = !,••• ,M. 

The {J>i(t)}-adaptability of the weight sequences, for each n, follows from the fact that the random 
times Ti^u{k), for all k, are stopping times w.r.t. {Tn{t)}. With the identification that 

mn,l{t) = Qi if ^ e ^n{t), 



where mn^i{t) denotes the message sent to agent n by agent I at time t, it is readily seen that, for 
each n, the process {Q"} is well-defined and adapted to the local filtration {Tn{t)}. We note that the 
update rule in ( [T0| is of the consensus -i- innovations form, in that it consists of the interplay between 
an agreement or consensus potential reflecting agent collaboration, and a local innovation potential that 
involves the incorporation of newly obtained intelligence through local sensing of the instantaneous 
cost. The convergence of the resulting algorithm may only be achieved by intricately trading off these 
potentials, which, in turn, imposes further restrictions on the algorithm weight sequences as follows: 
(M.5): The constants ri and T2 in (|ll|)-( 12l are assumed to satisfy ri G (1/2,1] and < r2 < 



Ti — 1/(2 + ffi), with £i being defined in ([6]). The above together with assumption (M.4) guarantee that 
the excitations from the consensus and innovation potentials are persistent, i.e., the (stochastic) sequences 
{c(i,u{t)} cif^d {Pi^uit)} sum to oo, for each state-action pair {i,u). They further guarantee that the 
innovation weight sequences are square summable, i.e., X]t>o ^ ^ '^^^ that, the consensus 
potential dominates the innovation potential eventually, i.e., (3i^u{t) / ai^uit) — ^ oo a.s. as t ^ oo for each 
pair {i, u). 

Remark 3.1 We comment on the choice of the weight sequences and associated with 

the consensus and innovation potentials respectively. From (M.5) (and (M.4)) we note that both the 
excitations for agent-collaboration (consensus) and local innovation are persistent, i.e., the sequences 
{Pi,u{'t)} and {ai^uit)} sum to oo - a standard requirement in stochastic approximation type algorithms 
to drive the updates to the desired limit from arbitrary initial conditions. Further, the square summability 
of {oLi^u{t)} (ti > l/2j is required to mitigate the effect of stochasticity (due to the randomness involved 

^Note that the phrase updated at an instant t refers to the possible transition of Q"u{t) to Q"^{t + 1), an event which 
actually occurs after the one-stage cost c„(xt, Ut) has been incurred and the successor state Xt+i has been reached. In terms 
of implementation, such an update may be realized at the end of the time slot t. 
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in the one-stage costs and the state transitions) the innovations. The requirement Pi^u{t)/cti^uit) oo as 
t — )• oo (ti > T2), i.e., the asymptotic domination of the consensus potential over the local innovations 
ensures the right information mixing thus, as shown below, leading to optimal convergence. Technically, 
the different asymptotic decay rates of the two potentials lead to mixed time-scale stochastic recursions 
whose analyses require new techniques in stochastic approximation as developed in the paper 

We further comment on the constants a and b in (l \ )-(|12|) affecting the weight sequences. While the 



main results and the proof arguments in this paper will continue to hold for arbitrary positive constants 
a and h, to simplify the exposition that follows we further assume that the constants are small enough, 
such that, for each time instant t and state-action pair (i, u), the matrix {In — (3i^n{t)Lt — cti,uii)lN) 
non-negative definite. Noting that the largest eigenvalue of the Laplacian Lt, at an instant t, is upper- 
bounded by N, the number of network agents, the above condition is ensured by requiring a and b to 
satisfy a + Nb < 1. We emphasize that the above requirement on a and b is by no means necessary, 
but greatly reduces the analytical overhead. In fact, for arbitrary positive a and b, (M.4)-(M.5) imply 
that, for each state-action pair {i,u), there exists to{i,u) > (possibly random), such that the matrix 
(I7V — fii^u{t)Lt — Cii,ui't)lN) non-negative definite for t > tQ{i,u). 

The rest of the paper is devoted to the convergence analysis of the proposed QP-leaming, in which 
our goal is to show that, for each n, V" — )• y* a.s. as t — )• 00, so that eventually each agent obtains 
the accurate value function and the corresponding optimal stationary strategy (through ([13])). To this 
end, for each n define the local QP-learning operator Q"- : rI'^^^I i— )■ rI-^^^I whose components 

Gl'^M) = E [cn{i, u)]+jY1 pIj mmg,,„, (14) 

for all Q = {Qi^u} G mI'^^"!. Noting that under (M.l), on {xi = i,ut = u}, 

E [c„(xt, Ut) I = E [cn{i, u)] , 



and 

E 



the recursive update in (10 1, for each state-action pair {i,u), may be rewritten as 



Qluit+l) = Qlu{t)-^3^,u{t) E [Qlu{t)-Qiu{t)) (15) 

/en„(t) 
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in which the residual 



V. 



plays the role of a martingale difference noise, i.e., IE[i/^^ ut(*) I -^t] = for all t. 
A. Main Result 

The main result of the paper concerning the convergence of the proposed QP-leaming is stated as 



follows (proof provided in Section 5-C I 



Theorem 3.1 Let {Q"} and {V"} be the successive iterates obtained at agent n through the QD- 
learning (see ([10]) and ([T3)j. Then, under (M.1)-(M.5), there exists Q* G RI'^^^I, such that, 



lim 

t— >oo 



Qr = Q*) 



for each network agent n. Further, for each i £ X, we have 

and, hence, in particular, V" — t- V* as t ^ oo a.s. for each n, where V* denotes the desired value 
function ([TJ. 

4. Intermediate Approximation Results 

This section provides some approximation results to be used in the sequel for the analysis of QT>- 
leaming. In what follows, {zt} will denote a stochastic process that is adapted to a generic filtration 
{T-Lt] (possibly different from {Ft}) defined on the probability space {Q.,F,V). 

The following result from |[26l will be used. 



Lemma 4.1 (Lemma 4.3 in II26II ) Let {zt} be an M+ valued \T-Lt} adapted process that satisfies 

zt+i < (1 - n{t)) Zt + r2{t)Ut (1 + Jt) . 



In the above, {ri{t)} is an {T-it+i} adapted process, such that, for all t, ri[t) satisfies < ri(t) < 1 
and 

j^^.<nri{t)\nt]<i 

with oi > and < 5i < 1, whereas, the sequence {r2(t)} is deterministic, IR+ valued, and satisfies 
r2{t) < {t + \Y^ with 02 > and 82 > 0. Further, let {Ut} and {Jt} be ]R_|- valued {Tit} cind {1-Lt+i} 
adapted processes respectively with supj>o \\Ut\\ < 00 a.s., and {Jt} is Ltd. with Jt independent of Ut 
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for each t and satisfies the moment condition E 
K > 0. Then, for every 6q such that 

1 



< K < oo for some ei > and a constant 



Q<5q<52-5_ 



2 + £l 

we have (t + l)^°zt — s- a.s. as t ^ oo. 



The following result from Il26ll . which provides a stochastic characterization of the contraction properties 
of random time-varying graph Laplacian matrices, will be used to quantify the rate of convergence of 
distributed vector or matrix valued recursions to their network-averaged behavior. 

Definition 4.1 For positive integers N and P, denote by C the consensus subspace o/M^^, i.e., 

C = {y e R^^ : y = 1^ (g) y' for some y' € M^} . 

Let C"*" be the orthogonal complement of C and note that any y G admits the orthogonal decom- 

position, y = yc + yc^. with yc denoting the consensus subspace projection of y. 

Lemma 4.2 (Lemma 4.4 in Il26ll ) Let {zt} be an valued {Tit} adapted process such that zt G C-^ 



(see Definition 4.1) for all t. Also, let {Lf} be an i.i.d. sequence of graph Laplacian matrices that satisfies 

\2{L) = \2 (E[Lt]) >0, 

with Lt being Tit+i adapted and independent of Tit for all t. Then, there exists a measurable {Ht+i} 
adapted M-|- valued process {rt} (depending on {zf} and {Lt}) and a constant Cr > 0, such that 
< < 1 a.s. and 

\\{Inp -ftLt0lp)zt\\ < (1 -n) \\zt\\ 



with 



M [rt I nt] > a.s. 



for all t, where the weight sequence {rt} satisfies rt < r/{t + 1) for some r > and 5 G (0, 1]. 

For a discussion of the necessary technicalities involved in the construction of the sequence {rt}, the 
reader is referred to ll26ll (Remark 4.1). 



Lemma 4.3 For each state-action pair {i^u), let {zj^u(t)} denote the {J-t} adapted process evolving as 
Zi,u{t + 1) = (In - l3i,u{t)Lt - ai^u{t)lN) ■2'i,u{t) + ai^u{t)i>i,u{t), (17) 
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where the weight sequences {/3j,u(t)} and {(Xi,u{t)} are given by (lll-(12i and is an {J^t+i} 

adapted process satisfying E[pj^„(t) | J^t] = for all t and 



supE 
t>o 



< K < oo, 



K being a constant. Then, under (M.4)-(M.5), we have Zj.„(f) as t ^ oo a.s. 



The following result will be used in the proof of Lemma 4.3 



Proposition 4.1 Let {zt} be a real-valued deterministic process, such that, 

zt+i < (1 - at) Zt + atet, 

where the deterministic sequences {at} and {st} satisfy at G [0, 1] for all t, X]t>o '^t ~ '^^^ there 
exists a constant R > 0, such that, 

limsupet < R. 

t—^oo 

Then, lim supj_^oo zt < R. 

Variants of the above result may be found in the literature. We provide a simple self-contained proof 
in the following. 

Proof: Consider 6 > and note that, by hypothesis, there exists ts > 0, such that, St < {R-\- 6) for 
all t > ts. Hence, for t > ts, 'we have 

zt+i < (1 - at) Zt + at{R + 6) . 

Hence, denoting by {%} the sequence with % = zt — {R + S) for all t, we have, for t > ts, 

zt+i<{l-at)zt. (18) 

Since J2t>ts ~ conclude that 

t-i 

lim sup Y\ (1 — cts) < lim sup e~ ^=='i "'^ = 0, 



t—^oo 



and hence, by ( [18] ), lim sup(_i.oo % < 0. We thus obtain 

lim sup Zt < R + 6, 



t—>-oo 



from which the desired assertion follows by taking 6 to zero. 



We now complete the proof of Lemma 4.3 
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Lemma 



4.3- Recall the consensus subspace C of (see Definition 4.1 with P = 1). By properties 
of the Laplacian, we obtain the following inequalities for each y € M^: 



X2{L) \\yc± W'^ < y'^Ly < Xn{L) \\yc± \\^ , 



and 



y^E [L?]y <ci||yc^| 



(19) 



(20) 



for each t, where ci > is a constant. Now consider the {J^t} adapted process {Vt}, such that Vt 
for each t, and note that under the hypotheses of Lemma 4.3 we have, 



E [Vt+i I Tt] =Vt- 2ai,u{t)Vt - 2Pi,u{t)zluit)Lzi,u{t) (21) 

< {l-2ai,u{t) + al^{t))Vt 
- {2Pi^u{t)X2{L) - (3l,{t)ci -2ai,u{t)PiAt)^N(L)) ||(zi,„(t))c^ f + a2„(t)c2, 

where C2 > is a constant and in the last step we make use of ([T9]l-([20]). 

Recall the stopping times {Tj „(A;)} and note that, by ([TT|)-([T2]), there exists a positive integer ko and 
a constant C3 > 0, such that t > Ti^u{ko) implies a.s. 

< (1 - 2ai,u{t) + al^{t)) < (1 - C3ai,u{t)) , 



and 



2Pi,u{t)X2{L) - ^iMci - 2ai,u{t)Pi,u{t)^N{L) > 0. 



(22) 



By (M.4), the stopping time Ti^u{ko) is finite a.s., and hence, for every e > 0, there exists > 
(deterministic), such that 

P {Ti,u{ko) > te) < e. (23) 



Now, for a given e > 0, construct the process {Vf} as follows: 

Vf = I{T,,u{ko)<te)Vt vt. 



(24) 



Since {Ti^u{ko) < te} G J^t,, we note that Vf is adapted to Tt for all t > t^. Also, by d2T])-(22i, for 
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t > te, vje have 

K [Vt+i I -^t] = I < te) E [Vt+i I Tt] 

< I {Ti,u{ko) < te) [(1 - 2ai,uit) + al^it)) Vt 
- (2ft,„(t)A2(L) - Pl^{t)ci - 2ai,«(t)A,„(t)A7v(I)) ||(z,,„(t))c^ f + a?,„(t)c2 
< I {Ti^u{h) < te) [(1 - C3ai,u{t)) Vt] + al^{t)c2 < (1 - C3a,,„(t)) + af,„(t)c2. 
With the above, the pathwise instantiations of the process {V/} clearly fall under the purview of 



Proposition 4.1 and we conclude that 



f lim Vf = 01=1. 



This, together with (|24]), impUes that the process {Vt} converges to zero on the event {Tj ^(/co) < te}, 
and, hence, by p3]) we obtain 



lim Vf = I >l-£. 

t~^oo 



Since e > is arbitrary, the desired result follows by taking e to zero. 



Remark 4.1 Note that, although the statement of Lemma 4.3 assumes (M.4)-(M.5) to hold, the only 
condition on the sequence that we actually use in the proof involves the requirement that ( |22| ) 

holds eventually. Given that p2[ ) holds trivially for all t if Pi^uit) = 0/or all t, we note that the assertions 



of Lemma 4.3 continue to hold if{f3i^u{t)} i^ ^^t to zero identically (i.e., the Laplacian dependent dynamics 



is dropped) in the update process ([TT). 

Corollary 4.1 For each state-action pair {i,u) and to > 0, consider the process {zi^u{t ■ to)}t>to that 
evolves as 

Zi^u{t + 1 : to) = {In - l3i^u{t)Lt - ai^u{t)lN) Zj,«(t : to) + ai^u{t)vi,u{t) 



with Zi^u{to '■ to) = 0, where the processes {/3i,u(t)}, {aj,u(t)}, and {i^i,u{t)} satisfy the hypotheses of 
Lemma 4.3 Then, for each e > 0, there exists a random time t^, such that, ||zj^„(t : to)|| < £ for all 



ts<to< t. 

Proof: Note that, for each t > to, 

^i,u{t : 0) - ( n (-^^ - ^iAt)I^t - ai,u{t)lN)j Zi,«(to : 0) 

< ||zi,„(t : 0)11 + ||zi,„(to : 0)11 , 



|zj,«(t : to) II 



't~i 



\S=to 
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where, to obtain the last inequaUty, we use that fact that, under (M.5) (see also Remark 3.1 1, 

\\In - Pi,u{t)Lt - ai^u{t)lN\\ < 1, Vt > 0. 



By Lemma 4.3 Zj^u(i : 0) — )■ as t — )• cxd a.s., and, hence, there exists t^, such that, 

hiAt ■■ 0)11 < e/2, Vt > t,. 



(25) 



(26) 



The result follows immediately from ([25|)-(|26|). ■ 
The following order-preserving property is readily verifiable. 

Proposition 4.2 Under (M.4)-(M.5), for each t > 0, the linear operator {I^ — l3iAt)Lt — ai,u{t)lN) 
is order-preserving on M^, i.e., for all x and y in with x <c y, we have 

{In - l3iAt)Lt - ai^t)!]^) x <c {In - /3iAt)Lt - aiAt)lN) y. 



Proof: For the matrix {In — Pi,u{t)Lt — cti^u{t)lN), note that, under (M.5) (see also Remark 3.1 1, the 



diagonal elements are non-negative. The off-diagonal elements being negatively scaled versions of those 
of the Laplacian Lt are also non-negative (by definition). Hence, the matrix {In — l3iAt)Lt — ai^u{t)lN) 
is non-negative, and (x — y) <c implies 

{In - f3i,u{t)Lt - aiAt)lN) (x - y) <c 0, 



from which the desired property follows. 



5. Convergence of QD-learning 



The current section focuses on the convergence analysis of QD-learning. Section 5-A obtains the 



boundedness of QD-leaming, whereas. Section 5-B establishes consensus of the agent updates to the 



networked average behavior. Finally, Section 5-C completes the proof of Theorem 3.1 by studying the 
properties of the resulting averaged network dynamics. 



A. QD-learning: Boundedness 

This section is devoted to obtaining the following boundedness of the QV iterates: 

Lemma 5.1 For each agent n, the successive refinement sequence {Q"} is pathwise bounded, i.e.. 



supllQnioo < oo ) = L 

t>0 
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Proof: The proof is inspired by a corresponding development in Q for the single-agent (centralized) 
case. Recall the local QP-learning operator Q^{-) defined in ( [T4] ). By (15l, for each n and state-action 
pair {i,u) 

Qlu{t + l) = Qlu{t)-|i^At) E {Qlu{t)-QL{t)) 

Denoting by {Qi,n(i)} the adapted process with Qi,«(t) = [Q}^u(.'t)i ' ' ' > we note that 

Qi,u{t + 1) = - ^i,uit)Lt - ai^u{t)lN) Qi,uii) + (^i,«(Qt) + '^i.^W) , (27) 

where g,,„(Qi)= [a/^JQD,--- , ^fJQf )] ^ and is defined as [u^^^^,- ■ ■ ,u^^J^ {see ^) on 

the event {x^ = i,ut = u}, and is taken to be zero elsewhere. By ([T4])-([T6]), it follows that E[j/j ^(t) | Tt] = 
for all t, and there exist positive constants ci and C2, such that 



E 



< Ci + C2 IIQt 



(28) 



with Q( denoting the R^l'^^^l-valued vector collecting the Qf's for n = [1, • • • , N]. Finally, note that, 
for each n and state-action pair {i,u), 

<c3+7iiQiL 

for all Q G rI'^^^I, where C3 > is a constant. Thus, there exist 7 G [0, 1) and a constant J > 0, such 
that 

|g,';jQ)| <7max(||Q||oo,J) (29) 

for all Q G RI'^^^I. Also, let e be such that 7(1 + e) = 1. 
Now consider the {Tt} adapted process {Mt}, given by 



Mt = max HQ 



yt. 



Let {Jt} be another {Tt} adapted process with Jq = J, and for each t > 0, Jt = Jt-i on the event 
{Mt < (1 + e)Jt_i}; otherwise, i.e., if Mt > (1 +e)Jt_i, Jt is defined by Jt = J(l + e)'=, where A; > 
is chosen to satisfy 

J{l + e)^-^ < Mt < J(l + e)^ 



The following hold by the above construction: 

Mt <{l + e) Jt, Vt > 0, 



(30) 
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Mt < Jt if Jt-i < Jt- 



(31) 



Assume, on the contrary, that {Qt} is not bounded a.s. Then, there exists an event B of positive measure, 
such that Mt — )• oo as t — )• oo on ^. 

To set up a contradiction argument, consider, for each state-action pair ii,u) and to > 0, the process 
{zi,„(t : to)}t>to that evolves as 



Zi^u{t + 1 : to) = (In - /3i,u{t)Lt - ai^u{t)lN) Zi^u{t : to) + ai,„(t)fi,„(t), 



(32) 



in which Zj^u(to : to) = and i>i^u{t) is a scaled version of fj,u(t) (see (27i-(28l), such that Vi^uit) = 
i^i,u{t)/Jt- Note that E[pj u(t) | Tt] = 0, which follows from ( [28] ) and the fact that Jt is adapted to Tt, 
and 

1 



E 



/2 



rE 



(33) 



, C]_ C2 IIQtll ^ , C4M/ 

- t2 + t2 - t2 + t2 
"^i -^t "^0 "^i 

< ^ + C4(l+e)^ <C5, 



where C4 and C5 are positive constants and we use ( |30l ) to obtain the penultimate inequality. Clearly, the 



construction (|32|)-([33) falls under the purview of Corollary 4.1 and we conclude that there exists an a.s. 
finite time tf, such that, 

\Wu{t-t^)\\<e (34) 

for all tf < to < t and state-action pairs (i^u). 

The hypothesis that Mt — )• 00 on the event B implies, by ( |30l ), that — )• 00 as t — )• 00 on ^S. Hence, 
by (|3T]), we may conclude that on B the inequality Mt < Jt holds infinitely often. Together with the 



construction in ([32|)-p4]), the above establishes the existence of an a.s. finite (random) time ti, such that, 
on the event B, Mt^ < Jt^ and 

\Ku{t : ti)|l < e 



for all t > ti and state-action pairs {i,u). 

To obtain a contradiction, we now show that, under the hypothesis Mt — 00 as t 
following set of inequalities hold a.s. on B for all state-action pairs {i,u) and t > ti: 

-Jt, (1 + e) In <c - Jt, {2i,u(.t : h) + Iat) <c Qi,«(t) 
<c (zj,«(i : h) + Iat) <c Jt, (1 + e) Iat, 

and 



00 on B, the 



(35) 



Jt — Jt, 



(36) 
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Before deriving the above, we note that (35l-(36l would imply that 

limsupMt < (1 + e) < oo 



(37) 



t— >-oo 



a.s. on B, thus contradicting the hypothesis that Mt — )• oo a.s. on the event B of positive measure. Hence, 



to establish Lemma 5.1 it suffices to obtain (35l-(36l which is pursued in the following 



We proceed by induction to establish (|35)-(|36|). Note that the claim holds trivially for t = ti as, by 
construction, Zi^u{ti : ^i) = and ||Qi,M(ti)||oo < < Jt^ for all state-action pairs Assume 



that (|35])-(|36]l holds for all s G {ti, • • • ,t}. To obtain ([35])-(|36]l for the {t + l)-th instant, we note that. 



under the induction hypothesis and by the order-preserving property in Proposition 4.2 we have 



{In - f3i,u{t)Lt - ai^u{t)lN) Qi,«(i) 
<c {In - Pi^u{t)Lt - ai^u{t)lN) {Ju'Z'i,u{t ■ h) + Ji^In) 
Jt^ {In - Pi,u{t)Lt - ai^u{t)lN)2i^u{t : ^i) + (1 - Jt^^N, 



0. From (29 1, 



where we also use the property of the Laplacian that Ljl^v 
induction hypothesis we obtain 

Qj,«(t + 1) = {In - l5i,u{t)Lt - o:i,u{t)lN) Qi,u{t) + {Gi,u{Qt) + ^iA^)) 
<c Jt^ {In - Pi,u{'t)Lt - aiAt)lN)-z.i,u{t : h) + (1 - ai,«(t)) J^'^n 

+ai,u{t) {Qi,u{Q,t) + n,u{t)) 

<c Jt^ {In - (3i,u{i)Lt - aiAt)lN)-Z'i,u{t : h) + (1 - ai,«(t)) J^'^n 

+ai,„(t)7(l + e)Jt,lN + ai,u{t)Jtii^i,u{'t) 
= Jt.lN + Jti [{In - Pi,u{t)Lt - aiAt)lN)-z.i,u{t : h) + ai,„(t)z^i,«(t)] 
= Jt.lN + Ju^i^t + 1 : ti) = Jt^ {zi^t + 1 : ti) + Iat) , 



and (37 1, and the 



which establishes the upper bound in ( 35 1 at f + 1 . The lower bound can be obtained similarly by invoking 



the order-preserving property and the induction hypothesis in the reverse direction. Finally, to obtain ( 36 1 



at t + 1, we note that the satisfaction of ( |35) at t + 1 implies by the induction hypothesis, 

Mt+i = IIQtIL < (1 + e) Jt, = (1 + e) Jt, 



and, hence, by definition, Jt+i = Jt = Jti- This establishes the desired set of inequalities (|35)-([36|) for 



all t>ti and Lemma 5.1 follows by the contradiction argument stated above ([37|). 



B. QD-learning: Asymptotic Consensus 

In this section, we establish the asymptotic agreement in agent updates. Recall, for each n, {Q^} to 
be the {Tt} adapted update sequence at agent n (see ([TO])). Denote by {Qt} the network- averaged iterate 
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process, i.e., 

TV 

Q, = (1/iV) J] Qr, yt. (38) 

n=l 

The goal of the section is to show that the local agent iterates eventually merge to the network-averaged 
behavior. Specifically, we will establish the following: 

Lemma 5.2 The agents reach consensus asymptotically, i.e., for each n, 

P(lim ||Q^-Q,|| =o) =1. 

Proof: Recall, by ( |27| ), for each state-action pair the process {Qj,«(t)} evolves as 

Qj,«(t + 1) = {In - Pi,u{t)Lt - ai^u{t)lN) Qj,«(t) + (^j,«(Qt) + yi,u{t)) , 



which, by ( |T0| ), may be rewritten as 

Qi,u{t + 1) = (In - PiAi)Lt - ai,u{t)lN) Qi,u{t) + »iAt) (U(t) + J{t)) , (39) 
where {U(t)} and {J{t)} are R^-valued processes whose n-th components are given by 

Unit) = 7minQ" ^{t) and J„,(t) = c„(xt,ut), 

respectively. 

For each A; > denote by Tik the cj-algebra associated with the stopping time Ti^u{k), see Q, i.e., 
T~ik = ^T,^{k)- By {zfc} denote the randomly sampled version of {Qi,u{t)}, i.e., for each k 

and note that the process {z^} is {Hk} adapted. Noting that the process {Qi,u(f)} may only change at 
the stopping time Tj ,i(A;)'s, the process {z^} evolves as 

Zfc+i = (^In - PkLk - aklN^ Zk + Qfc (U(/c) + J(/c)) , (40) 

where, by (M.4), we have 

l3k = l3i,u(Ti,um = b/{k + lY^, 

o-k = ai,u{Ti^u{k)) = a/{k + ly^ 

for all A; > 0. Finally, denoting by Lk and 3{k), the quantities Lx^^^(^k) ^nd J(Tj^„(/c)) respectively, 
by (M.1)-(M.2) we conclude that the processes {L^} and {J(A;)} are {T-Lk^i} adapted with and J{k) 
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being independent of T-Lk for each k. Further, for each k, '&[Lk\'Hk\ = L and the i.i.d. process {J(A;)} 
satisfies the moment condition 



E 



|2+ei 



< OO, 



(41) 



for a constant ei > (see (|6])). 

Let Zfc = (l/A^)l^Zjt denote the average of the components of z^. Using standard properties of the 
Laplacian Lk and (HOl), it follows that the residual = — z^Itv evolves as 



where 



Zfc+i = {In - PkLk - OiklN) Zyt + a/c [^k + Jfc j , 
Vk = {l-{l/N)lNll)\^{k) and = (/ - (l/A^)!^!?;) J(fc) 



(42) 



for all k. Noting that, by construction, z^ G C (see Definition 4.1 1 for all k, and hence, by Lemma 4.2 
there exists a measurable {T-Lk+i} adapted M+ valued process {rfc} and a constant > 0, such that 
< Tfe < 1 a.s. and 

II (-^Af - hLk - afc^7v)zfc|| < II (^7V - l3kLk)%\\ + "fcZfc 

< (1 - ^fc) llzfcll + "fc llzfcll 



with 



IE [rk I ^fc] > 



{k + l) 



for all A;. Since T2 < n (see Assumption (M.5)), there exists fco (deterministic) and another constant 
C2 G (0, 1), such that, 

II (-^AT - PkLk - afc/A?) Zfcll < (1 - C2rfc) ||zfc|| (43) 



for > /cq. By (|42]l and (|43) we obtain for all k>kQ 

||zfc+i|| < (1 - C2rfc) ||zfc|| + ak 



+ 



5.1 



Jfc 
and 



(44) 

Jfcll} is i.i.d. satisfying the 



Note that the process {||Ufe||} is pathwise bounded by Lemma 
moment condition in ( |4T] ). Hence, the update in ( |44| ) falls under the purview of Lemma |4. 1[ and we 
conclude that {k + l^Zk — )• as /c — )• oo a.s. for all r G (0, n — r2 — 1/(2 + si)). In particular, z^ — )• 
as /c — )• OO a.s., and, since {Qi,u(i)} ^ piecewise constant interpolation of {z^}, we obtain 



lim Qi;,(i)-Q,,Jt) =0 =1, 



for each agent n, with Qiui't) = (l/-^)l?^Qj,ti(i) denoting the component-wise average of Qj,u(t). 
Since the above can be shown for each state-action pair (i, n), the assertion follows. ■ 
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C. QD-leaming: Averaged Dynamics 



This section investigates the asymptotics of the network-averaged iterate {Q^} (see (|38|)). Since the 



agents reach consensus asymptotically (Lemma 5.2 1, it suffices to establish the convergence of {Qt} in 
order to obtain the main result of this paper. 

To this end, consider the (centralized) Q-leaming operator Q : rI'^^^I 1—5. rI'^^^I, whose (i,u)-th 
component Qi^u MI'^^^I i-^ R is given by 

N 

Qi,u (Q) = (l/iV) Ve [c„(i, n)] + 7 V Plj minQ,, , 

n=l jeX 

for all Q e RI-^^"!. Note that, informally, g{-) is the average of the local Q-leaming operators, i.e., for 
each Q G rI'^^^I and state-action pair {i,u), we have 

N 

g,,4Q) = (l/iV) J]g-„(Q), (45) 

n=l 



where the operators are defined in (14). It is readily seen that the following assertion holds: 

Proposition 5.1 The (centralized) Q-leaming operator is contraction. Specifically, we have 

P(Q)-e(Q')|L <7||Q-Q'|L VQ,Q'gmI^^"I. (46) 

Also, denoting by Q* the unique fixed point ofQ{-), we note that mhiueu Qiu ~ f^^ each i £ X, where 
V* denotes the i-th component of the unique fixed point V* of the (centralized) dynamic programming 
operator T{-), see Q-Q. 



The convergence of the network-averaged iterate process {Qj} (see ( [38] )) will be established in this 
section as follows: 

Lemma 5.3 Under (M.1)-(M.5) we have 



( lim ||Q,-Q*|| =0 =1, 



where Q* is the fixed point of Q{-) (see Proposition 5.1 ). 



The following result will be used in the proof of Lemma 5.3 



Lemma 5.4 For each state-action pair {i,u), let {zi^u{t)} denote the {Tt} adapted real-valued process 
evolving as 

Zi,uit + 1) = (1 - Oi^uit)) Zi^u{t) + ai^u{t) {vi,u{t) + ei^u{t)) , 
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where the weight sequence {aj^u(t)} is given by (12) and is an {Tt+i} adapted process satisfying 

^[i^i,u{t) I Tt] = for all t and 



fsupE[p2jt) I Tt] <oo)=l. 

\t>0 / 



(47) 



Further, the process {£i,u{t)} is {J't} adapted, such that, Si^uit) as t ^ oo a.s. Then, under (M.4)- 
(M.5), we have zi^uif) as t ^ oo a.s. 



Proof: Consider the auxiUary process with 

yi,u{^ + !) = (!- «i,-u(*)) ViA't) + aiAt)eiAt) 



(48) 



for all t. Note that Ei^t) — as t — > oo a.s. and, hence, for every 5 > 0, there exists ts (random), such 
that, ej,u(t) < 5 for all t > t^. Hence, by (|48]l for t > ts, we obtain 



\yi,ui^ + 1)1 < (1 - \yi,ui't)\ + ai,u{t)6, 



which, by a pathwise application of Proposition 4. 1 leads to 



limsup \yi,u{t)\ < 5. 



(49) 



t— >oo 



Since J > is arbitrary in (|49]l, we conclude that yi^t) — as t — oo a.s. Now, let us denote by 
{%,u{t)} the {Ft] adapted process that satisfies %At) = Zi^u{t) — yi,u{t) for all t. Then, 

Zi,uit + 1) = (1 - ai^t)) Zi^t) + ai,„(t)z^i,„(t) 



for all t. The hypothesis (47i and Egorov's theorem implies that, for every 5' > 0, there exists a constant 
Ks' > 0, such that 

supE [i?fM I J^t] <Ks') > 1 - 6'. (50) 
vt>o ' / 



Let TS' denote 



r^, =min{t>0 : E [i^f Jf) | Tt] > 6} , 



(51) 



and note that it readily follows that ts' is a stopping time w.r.t. {Tt} (see P9l ). Also, let {^'f„(i)} be 
the {Tt+i} adapted process, such that, i^fy_{t) = i?i^u{t)^{t < ts') for all t, and note that 



E 



Pf'jt) I Tt = I{t < Ts,)E I Tt] = 0, 



and 



E 



I{t < Ts-)E [i?l^{t) I Ji] < Ks', 



Knit) \Tt 



(52) 



(53) 



for all t, where the last inequality uses the definition of ts', see ( |5T] ). Finally, introduce the {Tt} adapted 
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process {zf'u{t)} that evolves as 



for all t, and note that by (pOb, we have 



sup 

t>o 



> 1 - (5'. 



(54) 



With 



we note that the process reduces to a scalar instantiation of the process in 



the hypothesis of Lemma 4.3 (with /3i,M(i) set to zero for all t, see also Remark 4.11, and we obtain 
zf'^{t) — )• as t — )• cxD a.s. Hence, by (|54]) we have 

lim = o) > 1 - 5'. 

Noting that 5' > above is arbitrary, we obtain ^(t) — )• as t — )• oo a.s., which together with the fact 
that yi^u{t) — )• as t — )• oo a.s. yield Zi^u{t) — ?• as t — )• oo a.s. ■ 
We now complete the proof of Lemma |5.3| 

Noting that l^Lt = and by ([39]) and (45 1, we have for each state-action pair (i, u) 



Lemma 



5.3 



Qi,u{^ + 1) = (1 - Ol^,u{t)) Qi^M + a^,u{i) {Gi,u{Qt) + i^i,u(t) + ^iA^)) 



(55) 



where {i'i,u(t)} and {ei,n(i)} {^t+i} and {J^t} adapted processes, respectively, such that Vi^u{t) 
{l/N)llu,^u{t) and 

N 



n=l 



for all t. Note that E[i/i J^t] = and the boundedness of the iterate process {Qi} (Lemma 5.1 
implies 

'supE [Dl^{t) I Ji] < oo I = L 



t>0 



Observing that the functional ^f„(-) are Lipschitz, we have 



N 



n=l 



for all t, and, hence, by Lemma 5.2 we conclude that £i^u{t) — )• as t — )• oo a.s. 



Now consider the auxiliary {Ft} adapted process {^^ ^(t)} for each state-action pair (i,n), such that 

Zi,u{'t + 1) = (1 - Zi^u{t) + ai^u{t) {i>i,u{'^) + £i,u{t)) (56) 

for all t. Based on the above discussion on the properties of the processes {i^i,u{t)} and {£i,u{t)} and 
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Lemma 5.4 we conclude that the process {zi^u{t)}, so constructed, satisfies Zi^u{t) — )■ as t — )• oo a.s. 



By Lemma 5.1 the process {Qj} is bounded and hence there exists an a.s. finite random variable R, 
such that 

i? = limsup||Qt-Q*|l . (57) 



Assume on the contrary that R ^ a.s. Then there exists an event B of positive measure such that R > 
on B. To derive a contradiction, let > be a constant, such that, 7(1 + 5) < 1 and consider the process 
{Qi,u{t)}, for each state-action pair such that Qi^u{t) = Qi,u{'^) — Zi^u{t) — Qlu Noting 

that Q* is a fixed point of the operator Q{-), we have using ( [55] ) and (|56]) 

Qi^u{t + 1) = (1 - ai^u{t)) Qi,u{t) + ai,u{t) {G^,uWt) - Gt,u{Q*)) (58) 
for all t. Hence, there exists ts (random), such that, 

\\Qt-Q*\\<R{l + 5) 



on B a.s. for all t > ts, Thus, by (|58), 

\QiAt + 1)1 < (1 - aiAt)) \QiAt)\ + «*,4i)7(l + s)R (59) 
on B a.s. for all t > ts, where we use the fact that, for each {i,u), the functional Gi,u{') ^ contraction 



with coefficient 7, see (46 1. A pathwise application of Proposition |4. 1 1 on (59 1 then yields 



Qi,uit) <7(l + ^)« >P(^) >0. 



lim sup 



Since, the above holds for each state-action pair {i,u) and 7(1 + 5) < 1, we conclude that, 

limsup ||Qj — Q*|| <R a.s. on i3. 

t—>ca 



Since, B has positive measure, the above contradicts with the hypothesis ( pT] ) and we conclude that R = 
a.s. This completes the proof. ■ 



Proof of Theorem 3.1 The first part of Theorem 3.1 follows from the fact that, for each n, Q" — ^ Q* 



a.s. as t — 00 by Lemma 5.2 and Lemma 5.3 The second part is an immediate consequence of the 



characterization of the limiting consensus value Q* achieved in Proposition 5.1 



6. Simulation Studies 

In this section we simulate the convergence rate behavior of the proposed QP-learning for an example 
setup. The setup consists of a network of = 40 agents with binary-valued state and action spaces, i.e., 
the cardinality of the state-action space X xU is 4. Thus, in all there are 8 controlled transition parameters 
p^j, i,j & X and u eU; 4 of these probabilities were chosen independently by uniformly sampling the 
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Fig. 1. Left: Centralized (dotted lines) and distributed QD (at a randomly uniformly selected agent, solid lines) Q-factors. 
Right: Consensus among distributed Q-factors. 



interval [0, 1], which also fixes the values of the remaining 4. For each n and state-action pair {i,u), the 
random one stage cost Cn{i, u) is assumed to follow a Gaussian distribution with variance 40, the mean 
(or expected one-stage cost) E[cn{i, u)] being another random sample (generated independently for each n 
and state-action pair (i, u)) from the uniform distribution on [0, 400]. The discounting factor was taken to 
be 7 = 0.7. For the purpose of simulating the distributed QV scheme, we considered a 2- nearest neighbor 
inter-agent communication topology with a 0.5 probability of link erasure (for all network links), i.e., the 
resulting communication network may be viewed as the N agents being symmetrically placed on a circle 
with each agent exchanging information with its two neighbors on either side. The performance of both 
the distributed QD-learning and centralized Q-leaming were simulated on a single (random) state-action 
trajectory {xf,Uf} instantiation - specifically, the state-action trajectory was generated by independently 
uniformly sampling control actions from U over time, whereas, the state trajectory {xt} instantiation was 
generated by sampling x^+i from the probability distribution p"' . independently of the past at each time 
t. Fig. [T] illustrates the typical sample path convergence behavior of the distributed Q2?-scheme and the 
centralized Q-learning, in which the distributed Q-factors, Q^^{t), corresponding to QP-learning were 
obtained using the recursions ([T0l)-([T2l), whereas, the centralized Q-factors, Qiu{t), were generated using 
the centralized (^-learning recursions 

Qluit + 1) = QUt) + a,,u{t) (^{l/N) J] c„(xi, ut) + 7mmQ^^^^_,(t) - Ql^{t)j (60) 

for each state-action pair {i,u) £ X xU. The exponent ri (see (M.5)) corresponding to the consensus 
weight sequence (in the distributed QV) was set to 0.2, whereas, the innovation weight sequence 

{<^i,u{i)} (for both the QV and centralized Q-learning ([60|) exponent ri was taken to be 1. In Fig. [T] 
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(on the left) we compare the evolution of the centralized Q factors with that of the distributed generated 
by the QV recursions at a randomly selected agent, i.e., for each pair {i,u) ^ X x U (with a total of 
4 such pairs), we plot the trajectories {Qf (depicted by dotted lines) and {Qf^i^)}, at a randomly 
uniformly selected agent n (depicted by solid lines), corresponding to the centralized and distributed 
respectively; whereas. Fig. [T] (on the right) illustrates the evolution of the Q-factors at 10 randomly 
(uniformly) selected network agents (for the distributed QV), verifying that they reach consensus on 
each state-action pair (i,n). 

From Fig. [T] we readily infer that the convergence rate of QV is reasonably close to that of centralized 
Q-leaming - more importantly. Fig. [1] demonstrates that the per-step convergence factor (i.e., the improve- 
ment over successive time steps) of QV approach that of the centralized Q-learning asymptotically (in 
the limit of large time t). While the above inference is drawn from the low-dimensional (4 state-action 
pairs) simulation setup considered (and while the absolute convergence rates of both the centralized 
and distributed will decline with increasing dimension of the state- action space), we expect the same 
relative convergence rate trend (i.e., the asymptotic equivalence of the distributed and centralized per- 
step convergence factors) to hold for more general (higher dimensional) setups. Intuitively, the negligible 
asymptotic convergence rate loss of the distributed with respect to the centralized is attributed to the 
asymptotic domination of the consensus potential over the innovations (recall f5i^u{t) / Oii^u{t) — ?• oo as 
t — )• oo), which enables each agent to essentially track the network aggregated innovation instantaneously 
in the limit of large time. 

7. Conclusion 

The paper has investigated a distributed multi-agent reinforcement learning setup in a networked 
environment, in which the agents (for instance, temperature sensors in smart thermostatically controlled 
building applications, or, more generally, autonomous entities in social computing and decision making 
applications) respond differently to a global environmental signal or trend. Our setup is collaborative 
and non-competitive, with the overall network objective being global welfare, i.e., specifically, the 
network is interested in learning and evaluating the optimal stationary control strategy that minimizes 
the network-average infinite horizon discounted one-stage costs. Rather than considering a centralized 
solution methodology that requires each network agent to forward its instantaneous (random) one-stage 
cost to a remote centralized supervisor at all times, we have focused on a distributed approach in which 
the network agents engage in in-network processing (learning) by means of local communication and 
computation. The resulting distributed version of Q-leaming, the QV scheme, has been shown to achieve 
optimal learning performance asymptotically, i.e., the network agents reach consensus on the desired value 
function and the corresponding optimal control strategy, under minimal connectivity assumptions on the 
underlying communication graph. Similar to direct adaptive control formulations (see, for example, |l5|), 
we have allowed generic statistical dependence on the state-action trajectories (processes) that drive the 
learning, which, in turn, in our distributed setting leads to mixed time-scale stochastic evolutions that 



January 25, 2013 



DRAFT 



31 



are non-Markovian (see ( [T0| ) and note that the state xt and control are general {Tt} processes). 
The analysis methods developed in the paper are of independent interest and we expect our techniques 
to be applicable to broader classes of distributed information processing and control problems with 
memory. For a low dimensional (in the size of the state-action space) example, the simulations in 
Section |6] indicate that the convergence rate of the proposed distributed QV scheme is reasonably close 
to that of the centralized implementation. While, in the same section it was argued that the per-step 
convergence rate of the distributed scheme should asymptotically approach that of the centralized in 
more general scenarios (higher dimensional setups) due to the asymptotic domination of the consensus 
potential over the innovations, an important future direction would consist of analytically characterizing 
the convergence rate of QP-learning under further assumptions on the state-action generation, for instance, 
by imposing specific statistical structure on the simulated state-action pairs, a commonly used approach 
being simulating the system response by i.i.d. generation of state-action pairs [1501 . In such cases, or more 
generally, cases in which the convergence rate of centralized Q-learning may be characterized ||5T| . it 
would be interesting to see whether the proposed distributed QD-learning entails any loss of performance 
(with respect to convergence rate) or not. Two other practically motivating and challenging future research 
topics concern the partial state information case, in which the global state process may not be perfectly 
observable at the local agent level, and the distributed actuation case, in which, instead of a remote 
controller acting on the global signal, the agents are themselves responsible for local actuations. 
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